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Abstract 

In this paper, we study the problem of determining a minimum state probabilistic finite state 
machine capable of generating statistically identical symbol sequences to samples provided. This 
problem is qualitatively similar to the classical Hidden Markov Model problem and has been 
studied from a practical point of view in several works beginning with the work presented in: 
Shalizi, C.R., Shalizi, K.L., Crutchfield, J.P. (2002) An algorithm for pattern discovery in time 
series. Technical Report 02-10-060, Santa Fe Institute, arxiv.org/abs/cs.LG/0210025. In this 
paper, we show that the underlying problem is NP-hard and thus all existing polynomial time 
algorithms must be approximations on finite data sets. Using our NP-hardness proof, we show 
how to construct a provably correct algorithm for constructing a minimum state probabilistic 
finite state machine given data and empirically study its running time. 


1 Introduction 

Hidden Markov models (HMMs) are a pattern recognition tool used to construct a Markov model 
when the state space is unknown. HMMs are used for a wide variety of purposes, such as speech 
recognition pQ, handwriting recognition [2 E|, and tracking 0. Given a process which produces 
some string of training data, there are many algorithms that are widely used to infer a Hidden 
Markov model for the process. In this paper, we focus on the approach developed by Shalizi [5f 
for producing e-machines from a string of input data, which can be thought of as a kind of HMM. 
This work is extended in mm and m. 

Shalizi’s approach to constructing a HMM is to find statistically significant groupings of the 
training data which then correspond to causal states in the HMM. In this formulation, each state 
is really an equivalence class of unique strings. This algorithm groups unique strings based on the 
conditional probabilities of the next symbol as a window slides over the training data. The window 
gradually increases in length up to a maximum length L, which is the maximum history length that 
contains predictive power for the process. This approach is called the Casual State Splitting and 
Reconstruction (CSSR) algorithm. The result of the CSSR algorithm is a Markov model where each 
state consists of a set of histories of up to length L that all share the same conditional probability 
distributions on the next symbol. 

Assuming an input string of infinite length where the conditional probability distributions con¬ 
verge to their true values, the CSSR algorithm produces a minimal-state e-machine for the given 
process. That is, any time two states could be combined while still maintaining the desired proper¬ 
ties of the e-machine, they are. However, for input sequences of finite length, the CSSR algorithm 
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does not guarantee a minimal-state machine due to the technique by which the strings are grouped. 
Consequently, state explosion can occur. This paper seeks to develop a new approach, based on the 
CSSR algorithm, which always guarantees a minimal-state HMM. We use a very similar idea to the 
CSSR algorithm, but formulate it as an integer programming problem whose goal is to minimize 
the number of states. We then provide a reformulation of the integer programming problem as a 
minimal clique covering problem, which provides a faster algorithm for finding the minimal-state 
HMM. 

2 Notation and Preliminaries 

Let A be a finite set of symbols or alphabet representing the observable events of a given system. 
Let y be the symbolic output of our system, so y is a sequence composed of the elements in A. We 
will write y = y\y 2 —Vn to denote the individual elements in y, so y* G A. 

Let A* be the set of all sequences composed of symbols from A, and let A be the empty sequence. 
If y is a sequence, then x is a subsequence if 

1. x is a sequence with symbols in A and 

2. There are integers i and k such that x = ytyi+i ■ ■ ■ yt+k- 

The subsequences of y of length k constitute the sliding data windows of length k over y. 

For us, a hidden Markov model is a tuple G = (Q , A, S,p), where Q is a finite set of states, A is 
a finite alphabet, dCQx^lxQisa transition relation, and p : 5 —> [0,1] is a probability function 
such that 

Y P{<h a ,<l') = l VgeQ (1) 

aeA,q'€Q 

This is slightly different than the standard definition of an HMM as observed in [lj because we 
are particularly interested in the state transition relation and associated probabilities, rather than 
the symbol probability distribution for a given state. 

The Baum-Welch Algorithm is the standard expectation maximization algorithm used to de¬ 
termine the transition relation and probability function of a hidden Markov model. However, this 
algorithm requires an initial estimate of the transition structure, so some initial knowledge of the 
structure of the Markov process governing the dynamics of the system must be known. 

Given a sequence y produced by a stationary process, the Causal State Splitting and Recon¬ 
struction (CSSR) Algorithm infers a set of causal states and a transition structure for a hidden 
Markov model that provides a maximum likelihood estimate of the true underlying process dynam¬ 
ics. In this case, a causal state is a function mapping histories to their equivalence classes, as in 

M- 

The states are defined as equivalence classes of conditional probability distributions over the 
next symbol that can be generated by the process. The set of states found in this manner accounts 
for the deterministic behavior of the process while tolerating random noise that may be caused by 
either measurement error or play in the system under observation. The CSSR Algorithm has useful 
information-theoretic properties in that it attempts to maximize the mutual information among 
states and minimize the remaining uncertainty (entropy) (9]. 

The CSSR Akjorithnr is straightforward. We are given a sequence y € A* and know a priori 
the value L G NJH For values of i increasing from 0 to L, we identify the set of sequences W that 
are subsequences of y and have length i. (When i = 0, the empty string is considered to be a sub¬ 
sequence of y.) We compute the conditional distribution on each subsequence x G W and partition 

1 There are some heuristics for choosing L found in [9] 
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the subsequences according to these distributions. These partitions become states in the inferred 
HMM. If states already exist, we compare the conditional distribution of subsequence x to the 
conditional distribution of an existing state and add x to this state if the conditional distributions 
are ruled identical. Distribution comparison can be carried out using either a Kolmogorov-Smirnov 
test or a x 2 test with a specified level of confidence. The level of confidence chosen affects the 
Type I error rate. Once state generation is complete, the states are further split to ensure that the 
inferred model has a deterministic transition relation <5. Algorithm |T] shows the CSSR Algorithm. 


Algorithm 1 - CSSR Algorithm 

Input: Observed sequence y; Alphabet A, Integer L 

Initialization: 

1: Define state go and add A (the empty string) to state go. Set Q = {go}. 

2: Set N := 1. 

Splitting (Repeat for each i < L) 

1: Set W = {x|3g £ Q(x £ g A |x| = i — 1)} (The set of strings in states of the current model with length equal to 
i- 1} 

2: Let N be the number of states. 

3: for each x £ W do 

4: for each a € A do 

5: if ax is a subsequence of y then 

6: Determine f ax \ y : A [0, 1], the probability distribution over the next input symbol. 

7: Let f q | y : A —> [0,1] be the unified state conditional probability distributions; that is, the probability 

given the system is in state g,;, that the next symbol observed will be a. For each j, compare f q . | y 
with f ax .\y using an appropriate statistical test with confidence level a. Add ax to the state that has 
the most similar probability distribution as measured by the p-value of the test. If all tests reject the 
null hypothesis that f q .\ y and f ax | y are the same, then create a new state gjv+i and add ax to it. Set 
N := N + 1. 

8: end if 

9: end for 

10: end for 
Reconstruction 

1: Let No = 0. 

2: Let N be the number of states. 

3: while No ^ N do 
4: for each i £ 1,..., N do 

5: Set k := 0 

6: Let M be the number of sequences in state qt. Choose a sequence xo from state qi. 

7: Create state p ;k and add xo to it 

8: for all sequences Xj (j > 0) in state g,: do 

9: For each a £ A Xj a produces a sequence that is resident in another state gj,. Let S(xj, a) = qt- 

10: For l = 0,..., k, choose x from pik . If S(xj, a) = 5(x, a) for all a £ A, then add Xj to Pik■ Otherwise, 

create a new state Pik+i and add x, to it. Set k := k + 1. 

11: end for 

12: end for 

13: Reset Q = {pik}', recompute the state conditional probabilities f q | y for g £ Q and assign transitions using the 

5 functions defined above. 

14: Set IVo = N. 

15: Set N to be the number of states in our current model. 

16: end while 

17: The model of the system has state set Q and transition probability function computed from the 5 functions and 
state conditional probabilities. 


The complexity of CSSR is 0{k 2L+1 )+ 0(A), where k is the size of the alphabet, L is the 
maximum subsequence length considered, and N is the size of the input symbol sequence[9]. Given 
a stream of symbols y, of fixed length N, from alphabet A, the algorithm is linear in the length of 
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the input data set, but exponential in the size of the alphabet. 


2.1 A Practical Problem with CSSR 

As observed in Shalizi and Shalizi (2004), the CSSR converges to the minimum state estimator 
asymptotically [9] , however it does not always result in a correct minimum state estimator for finite 
a finite sample. With an infinitely long string, all of the maximum likelihood estimates converge 
to their true value and the CSSR algorithm works correctly. That is, as N — > oo, the probability 
of a history being assigned to the wrong equivalence class approaches zero. A formal proof of this 
is given in [5] and relies on large deviation theory for Markov chains. 

The error in estimating the conditional probabilities with strings of finite length can cause the 
CSSR algorithm to produce a set of causal states that is not minimal. 

The following example will clarify the problem. Consider A = {0,1} and y defined as follows: 


Hi = 0 1 < i < 518 


[yi- 3 Vi-2 Vi- 1 Vi) = 1100 518 < i < 582 
[Vi -2 Vi- 1 Vi] = 100 582 < i < 645 
[m -2 Vi- 1 Vi] = 101 645 < i < 648 


( 2 ) 


Without loss of generality, we consider exclusively strings of length two. The strings, in order 
of appearance, are {00,01,11,10}. The conditional probability distribution on the next element for 
each string is shown in the following table: 


Table 1: Conditional Probabilities 


string 

Pr(0 string) 

Pr(l string) 

00 

0.9314 

0.0686 

01 

0.5789 

0.4211 

11 

1.0 

0 

10 

0.9737 

0.0263 


And the following p-values associated with comparing each string’s conditional probability dis¬ 
tribution to that of every other string: 


Table 2: p-values 


string 

00 

01 

11 

10 

00 

1.0000 




01 

0 

1.0000 



11 

0.0067 

0 

1 


10 

0.0944 

0 

0.7924 

1 


We now step through the Splitting phase of the CSSR algorithm: 

1. String 00 is put into its own state, q\ 

2. We use Pearson’s Chi-Squared test to test if the 01 and 00 have the same conditional distri¬ 
bution. This results in a p -value of approximately 0, as seen Table [2] so string 01 is put into 
a new state, q 2 
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3. We compare the conditional distribution of 11 to string 00. This results in a p -value of 0.0067, 
and comparing it to 01 results in a p-value of 0 , so string 11 is put into a new state, q 3 

4. We compare the conditional probability distribution of 10 to that of 00. This results in 
a p-value of 0.0944. Comparing it to 00 results in a p-value of 0, and comparing it to 11 
results in a p-value of 0.7924. Since 0.7924 is greater than 0.0944 and greater than the chosen 
significance level of 0.05, the string 10 is clustered with string 11, so q 3 = {11,10}. 

Thus, at the end of the Splitting step of CSSR, we are left we three different states: 

9 1 = { 00 } 

92 = {01} 

93 = {H, 10} 

We now go on to the Reconstruction step. Let Xij be a state in < 7 * where j £ 1, 2,..., |qj|. The 
Reconstruction step checks whether, for each a £ A, <5(x*i ,a) = < 5 (x,; 2 ,a) = ... = q &. Recall that 
5(xjj,a) = qk means that x^a produces a sequence that resides in state qk■ If this is not satisfied, 
then state qi must be broken up into two or more states until we have a deterministic transition 
relation function. 

In our example, the first two states do not need to be checked since they each only consist of one 
string. We check the third state: <5(11,0) = q 3 since string 10 £ < 73 , and <5(10,0) = q\ since string 
0 £ q\. Thus, determinism is not satisfied and state <73 must be split into two states. The result of 
the CSSR algorithm is a four-state Markov chain where each string resides in its own state. 

Now we will show why this result is not the minimal state Markov generator for this input 
sequence. Suppose that during the Splitting step, string four was put into state q\ instead of 53 . 
This could have been done soundly from a statistical point of view, since 0.0944 is also greater than 
our alpha-level of 0.05. However, by the CSSR heuristic, the fourth string was grouped according 
to the highest p-value. If the fourth string were put in < 71 , then after the Splitting step we would 
have gotten the following states: 


qi = { 00 , 10 } 

92 = {01} 

93 = {11} 

Now we move on to the Reconstruction step. This time, we only need to consider state < 71 . Notice 
that <5(10,0) = <5(00,0) = <71 and <5(00,1) = <5(10,1) = < 72 • We see that state qi does satisfy 
determinism, and does not need to be split. Thus, our final number of states is only three, which 
is minimal. This provides us with motivation to reformulate the CSSR algorithm so that it always 
produces a the minimal number of states, even when given a small sample size. This paper seeks 
to address this issue and propose a reformulation of the CSSR algorithm which always results in 
minimal state models even with a finite sample. 

3 Integer Programming Formulation 

Much of this section has been adapted from Cluskey (2013) [TO]. Suppose we have a string y £ A 
and we assume that L is known. Let W be the set of distinct strings of length L in y, and let 
\W\ = n. When we cluster the elements of W into states we must be sure of two things: 1) That 
all strings in the same state have the same conditional distributions, and 2 ) that the resulting 
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transition function is deterministic. Our goal is to formulate an integer programming problem that 
minimizes the number of states and preserves the two conditions above. Define the following binary 
variables: 

1 string i maps to state j 
0 else 


Xjn - 


(3) 


Remark 1. We assume implicitly that there are n states, since n is clearly the maximum number 
of states that could be needed. If, for some j , x t j = 0 Vi, then that state is unused, and our true 
number of states is less than n. 

Let: 


zu = 


1 string i maps to string j upon symbol a 
0 else 


(4) 


For example, 4 = 1 if i = (x\, ...,xl) and l = ( X 2 ,... : xl,o). Assuming j and k are state indices 
while i and l are string indices, we also define as 

{x^ = 1) A (4 = 1) A ( xi k = 1) =► (yj k = 1) 

This variable ensures the encoded transition relation maps each state to the correct next state 
given a certain symbol. Specifically, if string i in clustered in state j and string i transforms to 
string l given symbol a, and string l is clustered in state k, then (j, a, k) must be the transition 
relation. This can be written as the constraint 


(1 - Xij) + (1 - 4) + (1 - x tk ) + y° k > 1 Vi, j, k , l, 
To achieve determinism, we must have the condition that 

yjk < 1 Vj> 


a 


This condition ensures that for a given symbol, each state can transition to only one other state. 

We ensure that the strings in each state have the same conditional probability distributions 
using a parameter /i, where 


Ril = 


1 strings i and l have identical conditional probability distributions 
0 else 


(5) 


In order to determine if two strings have identical distributions an appropriate statistical test (like 
Pearson’s y 2 or Kolmogorov-Smirnov) must be used. The variables must satisfy 

(x^ = 1) A (xij = 1) =>- {nu = 1) 

This states that strings i and l can only belong to the same state if they have statistically indistin¬ 
guishable distributions. This can be written as the following constraint 

(1 - x^) + (1 - xij) + mi > 1 Vz, Z, j 

Finally, we define the variable p to be used in our objective function. Referring back to the 
remark under [3l this variable simply enumerates up the number of “used” states out of n. That is: 


Pj = 


0 £* x^ = 0 

1 else 


( 6 ) 
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This can be enforced as the following constraint 

Pi > Pi € {0,1} 

Note that < 1. These constraint are equivalent to Equation [6] because when Yli x ij = 0’ 

Pj will be 0, and when Y^i x ij > 0, Pj will be 1. We will minimize ^2jPj in our optimization 
problem. This addition extends the work done in m , in which a series of optimization problems 
had to be solved. 


Definition 1 (Minimum State Deterministic pFSA Problem). The following binary integer pro¬ 
gramming problem, which, if feasible, defines a mapping from strings to states and a probability 
transition function between states which satisfies our two requirements. 


P(N) 


min E Pj 

j 

s.t. (1 - Xij) + (1 - 4) + (1 - xik ) + Vj k > 1 Vi, j, k, l, cT 

Y.y'j ^ 1 V -?> 

k 

(1 - Xij) + (1 - Xij) + Hu > 1 Vi,Z,j 

\ X/i X ij y/ ■ ■ 

Pj > s Vz,J 

= i ^ 

j 


(7) 


and is called the Minimum State Deterministic pFSA (MSDpFSA) Problem. 
The following proposition is clear from the construction of Problem [7J 


Proposition 3.1. Any optimal solution to MSDpFSA yields an encoding of a minimum state 
probabilistic finite machine capable of generating the input sequences with statistically equivalent 
probabilities. 


Remark 2. Because this formulation does not rely on the assumption of infinite string length to 
correctly minimize the number of states, it succeeds where the CSSR algorithm fails. Instead 
of clustering strings into the state with the most identical conditional probability distributions, 
this optimization problem uses the identical distributions as a condition with the goal of state 
minimization. Thus, in the example, even though the fourth string had a higher p-value when 
compared to the third state than the first state, since both of the p-values are great than 0.05 
this algorithm allows the possibility of clustering the fourth string into either state, and chooses 
the state which results in the smallest number of final states after reconstruction. Unlike CSSR, 
this algorithm does not rely on asymptotic convergence of the conditional probabilities to their 
true value. However, it is clear that as the sample size increases the probability distributions will 
become more exact, leading to better values of /i. 


3.1 Resolution to the Example Problem 

Using the integer program formulation to solve the example previously presented does result in the 
minimum state state estimator of the process represented by the given input string. The integer 
program finds the following solution for the variables z, x, y , u, and p seen in Tabled Our objective 
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Table 3: Integer Program Variables 


(a) x values (b) u values (c) z° values (d) z 1 values 


- 

Qi 

Q2 

Q3 

94 



00 

01 

11 

10 


00 

01 

11 

10 



00 

01 

11 

10 

00 

1 

0 

0 

0 


00 

1 

0 

0 

1 

00 

1 

0 

0 

0 


00 

0 

1 

0 

1 

01 

0 

1 

0 

0 


01 

0 

1 

0 

0 

01 

0 

0 

0 

1 


01 

0 

0 

1 

0 

11 

0 

0 

1 

0 


11 

0 

0 

1 

1 

11 

0 

0 

0 

1 


11 

0 

0 

1 

0 

10 

1 

0 

0 

0 


10 

1 

0 

1 

1 

10 

1 

0 

0 

0 


10 

0 

1 

0 

0 


(e) y° values (f) y 1 values (g) p values 



91 92 93 



9i 92 93 

91 

1 0 0 


91 

0 1 0 

92 

1 0 0 


92 

0 0 1 

93 

1 0 0 


93 

0 0 1 


Table 4: State Transition Probabilities 



9i 

92 

93 

9i 

0.9341 

0.0659 

0 

92 

0.5789 

0 

0.4211 

93 

1 

0 

0 


function is to minimize the sum of the variables pj. As can be seen in Table (3J there is only one 
nonzero column in the x matrix (the last column). Thus, only p 4 = 0, and the the value of our 
objective function is 3. 

We can also recover a matrix of transition probabilities shown in Table [H Our final reduced 
state space machine is shown in Figure [I] 


0.9341 



Figure 1: The reduced state space machine derived from the input given by Equation [2] 


4 Computational Complexity of MSDpFSA 

In this section we assert that the integer program given by Problem [7] is NP-hard by reducing a less 
complex problem to the minimal clique covering problem. Recall given a graph, a minimal clique 
covering problem is to identify a set of cliques in the graph so that each vertex belongs to at least 
one clique and so that this set of cliques is minimal in cardinality. 

While the CSSR algorithm attempts to find a minimal state deterministic finite-state automata, 

































determinism is not a necessary property for accurate reconstruction. For a given input sequence, the 
construction of a finite-state automata which is not necessarily deterministic is less computationally 
intensive than solving the MSDpFSA problem, and is discussed by Schmiedekamp et al. (2006) 
Em- Like the CSSR algorithm, the algorithm presented in[l 1 ] is a heuristic which does not always 
results in a minimal state probabilistic FSA. The following definition provides an integer program 
for optimally solving a minimal state pFSA which is not necessarily deterministic. This is identical 
to Problem [7] except that we discard the constraints that ensure determinism. 

Definition 2 (Minimum State Non-Deterministic pFSA Problem). The following binary integer 
programming problem, which, if feasible, defines a mapping from strings to states and a probability 
transition function between states which is not necessarily deterministic. 


" min E Pj 

j 


P'(N) = < 


s.t. (1 - Xij) + (1 - xij) + mi > 1 Vi,l,j 
Yli x ij 


Pj > 


s 


Vi,j 


— i Vz 


j 

and is called the Minimum State Non-Deterministic pFSA (MSNDpFSA) Problem. 
Proposition 4.1. MSNDpFSA is NP-Hard. 


( 8 ) 


Proof. Let G = (V, E) be the graph on which we want to find the minimal clique covering. Assume 
that V = {1, 2 ,..., n} and let / be a n x n matrix such that Iij = 1 <;=> there is an edge connecting 
vertices i and j. We reduce Problem [ 8 ] by letting n be the number of unique strings of length L 
in y, so we can let each string correspond to a vertex of G. Let 1^ = 1 IMj = 1- This 

means that two strings are connected if and if only if they have identical conditional probability 
distributions. We show that Problem [ 8 ] is equivalent to finding a minimal clique cover of G where 
")T pj is the number of cliques. Let the set of cliques corresponding to the minimal clique cover of 
G be C = {Ci,..., C m }, where Cj is a specific clique, and Cj = Vj CV. 

We can define the variables x by Xj-j = 1 V) G Cj. Thus, the constraint that (1 — x^) + 

(1 — xij) + pu > 1 Vz, l,j simply means that if two vertices are in the same clique then there must 
be an edge between them. We also have that pj = 0 iff there is at least one vertex in clique j, so 
the set of j such that pj = 1 corresponds to the non-empty cliques, i.e., is identical to C (since 
C only consists of non-empty cliques). Thus, minimizing ~2jPj is equivalent to minimizing the 
number of cliques needed to cover G. The constraint that JA = 1 Mi simply means that each 
vertex belongs to exactly one clique. Thus, it is clear that the integer program given in Problem [ 8 ] 
is equivalent to a minimal clique covering and is NP-hard. □ 


Remark 3. A proof of the NP-hardness of Problem [ 8 ] would be similar to the proof of Proposition 
HU except it would include the deterministic element of the resulting probabilistic finite state 
machine. This does not add substantially to the proof and the computational complexity of the 
problem should not decrease when passing to the deterministic case because of the added degree 
of complexity due to the enforcement of determinism. 


4.1 Example of Reduction 

We will illustrate the equivalence of Equation [ 8 ] and the minimal clique covering through an example. 
Using the same example as before, the resulting graph G is shown in Figure EJ The string 00 and 
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10 have an edge in common because they have identical conditional distributions as noted by Table 
[2j By the same reasoning, 10 and 11 also have an edge in common. However, 00 and 11 do not 
share an edge, and 01 has no incident edges. 

The integer program given by Equation [8] produces either of the following two results: 

Q = {qi = {00,10}, q 2 = {11}, <73 = {01}} or 
Q' = Ui = {11,10}, q 2 = {00}, q 3 = {01}} 

Both of these two groupings results in one of two minimal clique coverings. Let V\ = 00, V 2 = 
10, = 11, V 4 = 01. 


C = {C = {1,2}, C 2 = {3}, C 3 = {4}} or 
C' = {C 1 = { 2,3}, C 2 = { 1}, C 3 = {4}} 



Figure 2: The graph G corresponding to the string y given in Equation [2j 



Figure 3: The graph G considered by the CSSR algorithm 


When the determinism constraint is added to the integer program, the result is Q' (or equiva¬ 
lently C') instead of Q (or C ). The solver recognizes that Q (or C ) would have to be split in order 
for determinism to be satisfied, so Q' (or C') is chosen instead. If we think of the CSSR algorithm 
in terms of our graph equivalency, the CSSR algorithm does not consider the existence of an edge 
between 00 and 10. Let T be Table [21 For a vertex i, CSSR only places an edge between i and 
argmax J y i {T(z, j) : T(i,j ) > 0.05}.Thus, by the CSSR algorithm, the graph is really the image 
shown in Figure 0 In this formulation, there is only one minimal state clustering (minimal clique 
covering), G (or C ) which does not satisfy determinism, so G is split into four states. 

5 Minimal Clique Covering Formulation of CSSR Algorithm 

In this section we present an algorithm for determining the minimal state hidden Markov model 
using a minimal clique vertex covering reformulation. As shown in the previous section, the two 
problem formulations are equivalent if we exclude the determinism constraint. Let W = {Si, ..., S n } 
be the set of unique strings of length L in A, so W is the set of vertices in our graph formulation. 
In the revised algorithm, we first use the CSSR Algorithm, Algorithm [l] to find an upper bound 
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on the number of cliques needed. We then use Bron-Kei'bosch algorithm to enumerate all maximal 
cliques, C = {C±, C 2 , C m }, given as Algorithm 121 [12]. 

Define H := argmin^{|i?| : R C C A U {Ri = W}. Thus H is the set of maximal cliques of 
minimal cardinality such that every vertex belongs to at least one clique. This can be found using 
a simple binary integer program given in Algorithm [3j We can think of Algorithm [3] as defining a 
mapping from every string to the set of all maximal cliques. A clique is “activated” if at least one 
string is mapped to it. The goal, then, is to activate as few cliques as possible. Of course, each 
string can only be mapped to a clique in which it appears. We define a few variables, which are 
present in our integer program: 


Vi = 



Ci£H 

else 


The variable y can be thought of as whether or not clique Ci is activated. 


(9) 


lij 


1 Si eCj <E H 
0 else 


( 10 ) 


s ij — 



Siis mapped to Cj G H 
else 


( 11 ) 


To distinguish between I and s, notice that each string is only mapped to one clique, but each 
string could appear in more than one clique. 


Proposition 5.1. The set H is a minimal clique vertex covering. 

Proof. Suppose there is a minimal clique vertex covering of a graph G with k cliques such that every 
clique is not maximal. Choose a non-maximal clique and expand the clique until it is maximal. 
Continue this procedure until every clique is maximal. Let our set of cliques be called R. Clearly 
i?.| = k since no new cliques were created. Also note that R C C since R consists of maximal 
cliques. Further, U iRt = W since we started with a clique vertex covering. Finally, notice that 
R = argmin/f {\H\ \ H C C A U j-ffj = W} because \R\ = k and k is the clique covering number of 
G, so it is impossible to find an H with cardinality less than k. □ 


Proposition 5.2. Any solution to Q(C) results in a minimal clique vertex covering. 

Proof. This is clear by noting that argrnin Q(C) results in the subset, H, of C = {Ci, C2, ..., C m } 
where H is defined as argmin^{|i?| : R C C and U* Rj = W} in conjunction with Proposition 

O □ 

Before discussing the rest of the algorithm, which concerns determinism, we first state an 
important remark about Algorithm [3j It is possible that the set H is such that some vertices are 
in more than one maximal clique. However, we are actually interested in the set of minimal clique 
vertex coverings for which each vertex only belongs to one clique, which can be extracted from 
H. See Figures |4] and [5] for an example. The set H is shown in Figure |4l From this set H , we 
can deduce the two minimal clique vertex coverings shown in [5j In initializing Algorithm [I] which 
imposes determinism on each covering, we find the set of all minimal coverings for which each 
vertex belongs to one clique. 
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Algorithm 2 -Finding all maximal cliques 
Input: Observed sequence y; Alphabet A, Integer L 

Setup: 

1: Call the set of strings that appear in y W = {Si,£ n }, so W is our set of vertices 
2: Determine / x | y (a) for each unique string 

3: Generate an incident matrix, U € R 7lxn , U,,- = Uij, where uy is defined in [5] 

4: P 4— [1, 2, ...,n] {P holds the prospective vertices connected to all vertices in R} 

5: R 0 {Holds the currently growing clique} 

6: X 0 {Holds the vertices that have already been processed} 

function Bron-Kerbosch(R,P,X) 

1: k<- 0 

2: if P = 0 and X = 0 then 

3: report R as a maximal clique, R = Ck 

4: k <— k + 1 

5: end if 

6: for each vertex v in P do 

7: Bron-Kerbosch(i?. Ut),Pn N(v), X fl N(v)) {where N(v) are the neighbors of u} 

8: P <- P\v 

9: X <— X Uv 

10: end for 

11: return {Ci, C 2 ,..., C m } 


Algorithm 3 -Minimal Clique Vertex Covering 
Input: Observed sequence y; Alphabet A, Integer L 
Set-up: 

1. CSSR 


run the CSSR Algorithm (Algorithm |T|) 
return k, the number of cliques found by CSSR 
2. Find all maximal cliques 
run Algorithm [2] 
return {Ci, C 2 ,..., Cm} 

Finding all minimal clique coverings: 

Let Uj = 1 denote string Si belonging to clique Cj 
Let Sij = 1 denote string Si being mapped to clique Cj 

' min yi {yi indicates whether clique Ci is being used or not} 


Q(C) = 


S .t. y, > ^ 

Vi < ^2 Sij 


^yi<k 


Sij ^ I{j 

J2 Si i = 1 

3 
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Once we have determined the set of minimal clique covers of our original graph where each vertex 
only belongs to one clique, we select a final clique covering which is minimal and deterministic. 
This is done by considering each minimal clique covering individually and then restructuring it 
when necessary to be deterministic. This corresponds to the reconstruction part of Algorithm [T] 
Note in Algorithm 2] each vertex corresponds to a string of length L, which came from an initial 
string y. Also recall that we have previously defined the binary variables z as 


zu = 



string i maps to string j upon symbol a 
else 


where a is any element of our alphabet A. The following proposition is clear from the construc¬ 
tion of Algorithms [3] and []] 


Proposition 5.3. Algorithm 0 along with Algorithm^ produces an encoding of a minimum state 
probabilistic finite machine capable of generating the input sequence. 



Figure 4: An example output from Algorithm [3] 



Figure 5: Minimal clique coverings where each vertex is only in one clique 
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Algorithm 4 - Minimal Clique Covering Reconstruction 
Input: Observed sequence y; Alphabet A, Integer L; 

Set-up: 

Perform Algorithm [2] on y and obtain a minimal clique covering 

From this initial clique covering, find the set of all minimal coverings such that each vertex belongs to exactly one 
clique, an example of which is seen in Figure [5] Let this set of minimal coverings be T = {7\,..., Ti} 

From y, A, and the set {Si,..., Sn} as found in Algorithmic determine the matrix z. 

Minimal Clique Covering Reconstruction: 

1: for h = 1 to l do 

2: i = 1 

3: N = the number of cliques in each {n}ote that all Th have the same number of cliques because they are all 

minimal 
4: M = N 

5: while i < M do 

6: N = M 

7: F = the set of all vertices (strings) which are in clique i 

8: if length(F) > 1 then 

9: for j = 2, 3,..., #{F} do 

10: Use z to find what clique F(j) maps to for each a £ A 

11: if F(j) does not map to the same clique as F( 1) upon each a then 

12: if M > N and F(j) maps to the same clique as some F(k ) upon each a, k £ [N + 1 ,M] then 

13: Add F(i) to the clique containing F(k) 

14: else 

15: Create a new clique containing F(i) 

16: M = M + 1 

17: end if 

18: end if 

19: end for 

20: end if 

21: i = i + 1 

22: end while 

23: end for 

24: FinalCovering = min{Ch\Ch has the minimum number of columns V/i} 
return FinalCovering 


6 Comparing run times of modified CSSR algorithms 

This paper has discussed three different approaches for determining minimal state hidden Markov 
models from a given input sequence: the CSSR algorithm, CSSR as an integer program, and the 
minimal clique covering reformulation. We now compare the run times of all three algorithms. 
We find that the CSSR algorithm has the fastest run time, however it does not always produce a 
minimal state model as we have seen in a prior example. The minimal clique covering reformulation 
is relatively fast and always results in a minimal state model. The integer program reformulation 
is extremely slow, but also always results in a minimal state model. This makes CSSR a useful 
heuristic for solving the NP-hard MSDpFSA problem. 

The CSSR integer program, using only a two-character alphabet and sequence of length ten as 
input, takes about 100 seconds to run. We can see in Figure [6] that the minimal clique covering 
reformulations takes less than 0.2 seconds for sequences up to length 10,000. For only two character 
alphabets, the run time appears to be a nearly linear function of the sequence length, with spikes 
occurring presumably in cases where more reconstruction was needed. In Figure [6] we see the run 
time of the minimal clique covering reformulation for a three-character alphabet. For sequences up 
to length 100, the algorithm took no more than 120 seconds, which is remarkably better than the 
integer program. It is much less linear than shown in Figure [6] because reconstruction is needed 
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more frequently for three characters. 


0.45 

2 Character Alphabet 


0.35 




Figure 6: Clique covering reformulation times 

For two-, three-, and 4-character alphabets, we also compare the minimal clique covering for¬ 
mulation time to that of the original CSSR algorithm. This can be seen in Figure [7] where we take 
the average of the run times for the CSSR and clique covering formulation to compare the two. 
Note that with a two-character alphabet, the clique covering formulation is slightly faster, but for 
the three- and four-character alphabets the CSSR algorithm is significantly faster. This is due to 
the fact that the CSSR algorithm does not actually guarantee a minimal state Markov model. 

7 Conclusion and Future Directions 

In this paper we illustrated the the problem stated by Shalizi and Crutchfield |5j of determining a 
minimal state probabilistic finite state representation of a data set is NP-hard for finite data sets. 
As a by-product, we formally proved this to be the case for the non-deterministic case as studied 
in m- As such, this shows that both the CSSR algorithm of [5] and CSSA algorithm of m can 
be thought of as low-order polynomial approximations of NP-hard problems. 

Future work in this area includes studying, in detail, these approximation algorithms for the 
problems to determine what there approximation properties are. In addition to this work, determin¬ 
ing weaknesses in this approach to modeling behavior are planned. We are particularly interested 
in the affects deception can have on models learned in this way. For example, we are interested in 
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Sequence length 



Figure 7: Minimal clique covering run times compared to CSSR 
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formulating a problem of constrained optimal deception in which a learner, using an algorithm like 
the one described here or the CSSR or CSSA approximations, is optimally confused by an input 
stream that is subject to certain constraints in its incorrectness. 


A Computing f q .\ y and / x)y 


The following formulas can be used to compute f q .| y and f ax i y in Algorithm [TJ Let #(x, y) be the 
number of times the sequence x is observed as a subsequence of y. 


/x| y (a) = Pr(a|x,y) 


#(xa,y) 

#(x,y) 


f qi | y (o) = Pr(a|%,y) 


#( xa > y) 
E xGg ,#( x ’y) 


( 12 ) 

(13) 
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